suppressPackageStartupMessages({
library(reticulate)
library(tidyverse)
library(highcharter)
library(xts)
library(TTR)
library(forecast)
library(Metrics)
library(rlang)
library(prophet)
library(rstudioapi)
})
knitr::opts_chunk$set(echo = TRUE)
# Set working dir to file location
if(Sys.getenv("RSTUDIO") == "1") {
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
} else {
setwd(getSrcDirectory()[1])
}
# Python resources
use_virtualenv("../venv")
source_python("../python/functions.py")
# Highcharter custom theme
thm <- hc_theme(
colors = c("red", "green", "blue"),
chart = list(
backgroundColor = "#0A1741"
),
title = list(
style = list(
color = "#dddddd",
fontFamily = "Helvetica"
)
),
subtitle = list(
style = list(
color = "#dddddd",
fontFamily = "Helvetica"
)
),
legend = list(
itemStyle = list(
fontFamily = "Helvetica",
color = "black"
),
itemHoverStyle = list(
color = "#dddddd"
)
),
yAxis = list(
gridLineWidth = 0.5,
labels = list(style = list(color = "#dddddd"))
),
xAxis = list(
labels = list(style = list(color = "#dddddd"))
)
)
Using daily Nvidia (stock symbol NVDA) closing price stock data I will develop functionality for plotting OHCL stock price data, exponential weight moving average data, and decomposing the time series data. I will then split the data for training, validation, and testing so I can develop exponential smoothing, ARIMA, Prophet, kNN, and LSTM-RNN models for time series for forecasting.
The intent of this analysis is to develop models for predicting the price of NVDA five days out to help inform weekly options trading strategies.
I will use the reticulate R package to interface with the Python code I wrote in wrangle.html to fetch stock symbol data. I will only keep dates up until April 29th, 2022 so that the code is more reproducible.
stock_symbol <- "NVDA"
nvda <- get_stonk(stock_symbol)
# Need to convert to dataframe
df_nvda <- nvda$data %>%
py_to_r() %>% as.data.frame() %>%
rownames_to_column("date") %>%
mutate(date = as.Date(date, format = "%Y-%m-%d")) %>%
filter(date <= "2022-04-29")
df_nvda %>% tail()
## date Open High Low Close Volume
## 5853 2022-04-22 203.03 204.86 195.00 195.15 62471300
## 5854 2022-04-25 192.02 199.45 190.96 199.02 64156600
## 5855 2022-04-26 197.18 197.88 186.70 187.88 65314300
## 5856 2022-04-27 185.98 191.67 182.90 184.15 49946000
## 5857 2022-04-28 189.67 200.37 184.90 197.82 57032700
## 5858 2022-04-29 194.02 201.28 185.17 185.47 50001100
Now let’s convert the data frame to an xts time series object. I will also drop the volume data for now.
xts_nvda <- xts(df_nvda[-c(1,6)], order.by = df_nvda$date, frequency = 365.25) %>% na.approx(na.rm = TRUE)
head(xts_nvda)
## Open High Low Close
## 1999-01-22 0.4018236 0.4484636 0.3563794 0.3767094
## 1999-01-25 0.4066070 0.4209579 0.3767095 0.4161746
## 1999-01-26 0.4209577 0.4293294 0.3779051 0.3838852
## 1999-01-27 0.3850812 0.3946487 0.3635544 0.3826895
## 1999-01-28 0.3826894 0.3850810 0.3791010 0.3814936
## 1999-01-29 0.3814936 0.3826894 0.3635543 0.3635543
I will now create a candlestick OHLC plot using an R wrapper of the highcharts library. My intent is to add the EWMA analysis I did in python onto this in a way such that the OHLC data is shown first and then the user can select from the 4 various EWMA windows to view them on top of the candlestick plot. Here I will only show the 12 and 26 day EWMA.
In order to do this I must first convert the EWMA data to R appropriate xts objects.
ewma <- data.frame(
date = nvda$ewma$`12` %>%
py_to_r() %>% as.data.frame() %>%
rownames_to_column("date") %>%
mutate(date = as.Date(date, format = "%Y-%m-%d")) %>%
dplyr::select(date) %>% as.vector(),
ewma12 = nvda$ewma$`12` %>% py_to_r() %>% as.vector(),
ewma26 = nvda$ewma$`26` %>% py_to_r() %>% as.vector(),
ewma50 = nvda$ewma$`50` %>% py_to_r() %>% as.vector(),
ewma200 = nvda$ewma$`200` %>% py_to_r() %>% as.vector()
) %>%
mutate(
ewma12 = xts(ewma12, order.by = date, frequency = 365.25) %>% na.approx(na.rm = TRUE),
ewma26 = xts(ewma26, order.by = date, frequency = 365.25) %>% na.approx(na.rm = TRUE),
ewma50 = xts(ewma50, order.by = date, frequency = 365.25) %>% na.approx(na.rm = TRUE),
ewma200 = xts(ewma200, order.by = date, frequency = 365.25) %>% na.approx(na.rm = TRUE)
) %>%
filter(date <= "2022-04-29")
ewma <- xts(dplyr::select(ewma, -date), order.by = ewma$date)
ewma %>% tail()
## ewma12 ewma26 ewma50 ewma200
## 2022-04-22 220.5298 233.4014 241.5574 235.1118
## 2022-04-25 217.2206 230.8546 239.8893 234.7527
## 2022-04-26 212.7067 227.6713 237.8497 234.2863
## 2022-04-27 208.3133 224.4475 235.7438 233.7874
## 2022-04-28 206.6990 222.4751 234.2566 233.4296
## 2022-04-29 203.4330 219.7340 232.3434 232.9523
Now I can plot my EWMA analysis on top of a candlestick plot using highchart.
xts_nvda %>%
hchart(upColor = "#67e826", color = "#db003e", lineColor = "#696969", name = "OHLC") %>%
hc_title(text = paste0(stock_symbol, " - EWMA analysis")) %>%
hc_add_theme(thm) %>%
hc_tooltip(crosshairs = FALSE, valueDecimals = 2) %>%
hc_add_series(data = ewma$ewma12, color = "#9671e5", name = "12 day:") %>%
hc_add_series(data = ewma$ewma26, color = "#e571a9", name = "26 day:")
To explore the data above you can select a time frame in the upper left and then slide the window along the scroll bar on the bottom.
From exploring the figure above we can see that generally when the 12 day EWMA is above the 26 day there is bullish action where as the opposite is also true and when the two values are very similar there is little price movement. While this is not a precise forecast it does give us useful information.
I will now explore the methodology for visualizing the decomposition of the Closing price data.
Let’s start with a decomposition of the data before we move on to any transformations.
close_ts <- ts(xts_nvda$Close, frequency = 365.25)
decompose(close_ts) %>% plot()
We can clearly see that the data is not stationary and has a stark trend upwards at the end. There is also a repeating seasonality component and the last quarter or so of the data displays increasing variance in the residuals.
Let’s look at it’s Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots.
close_ts %>% ggtsdisplay(lag.max = 20, main = paste0(stock_symbol, " - NVDA closing price Series, ACF and PACF"))
From the PACF we can see that a lag of 1 is most appropriate.
Below is a generalized function to grid search for the best transformation function and differencing lag to use for performing a stationary transformation. I’ll be determining the “best” stationary transformation as the lowest standard deviation in a rolling variance window of 20% of the data.
The default settings only use two lags and the following transformation functions.
# Return either best trans data or entire grid search
determine_trans <- function(xts_obj, lags=1:2, prop_win=0.2, all_=FALSE) {
lambda = BoxCox.lambda(xts_obj, method = "loglik")
boxcox <- function(x, lambda) {
lambda = BoxCox.lambda(x, method = "loglik")
BoxCox(x, lambda)
}
inv_boxCox <- function(x, lambda = lambda) {
if(lambda == 0) exp(x) else(lambda*x + 1)^(1/lambda)
}
funcs <- c(function(x){x}, log, sqrt, boxcox)
names(funcs) <- c("no_trans", "log", "sqrt", "boxcox")
inv_funcs <- c(function(x){x}, exp, function(x){x**2}, inv_boxCox)
names(inv_funcs) <- c("no_trans", "exp", "square", "inv_boxcox")
min_sd_val <- Inf
min_sd <- ""
all_trans <- list()
names_list <- c()
{for(l in lags) {
for(i in 1:length(funcs)) {
name = names(funcs)[i]
func = funcs[[i]]
inv_func = inv_funcs[[i]]
trans <- xts_obj %>% func
diff <- diff(trans, lag = l)
diff[l:1,1] <- 0 # Impute initial diff'ed values to 0
roll_var <- na.omit(runSD(diff, n = round(nrow(diff)*prop_win)))
roll_var_sd <- sd(roll_var)
dat <- list(
"name" = paste0("l", l, "_", name),
"lag" = l,
"func" = func,
"inv_func" = inv_func,
"og" = xts_obj,
"trans" = trans,
"diff" = diff,
"decomp" = decompose(ts(diff, frequency = 365.25)),
"lambda" = lambda
)
assign(paste0("l", l, "_", name), dat)
all_trans[[length(all_trans)+1]] <- dat
names_list <- c(names_list, paste0("l", l, "_", name))
if(roll_var_sd < min_sd_val) {min_sd_val <- roll_var_sd; min_sd <- paste0("l", l, "_", name)}
}
}}
names(all_trans) <- names_list
if(!all_) {
return(get(min_sd))
}
all_trans
}
trans_ts <- determine_trans(xts_nvda[,4])
trans_ts$name
## [1] "l1_log"
It seems performing a log transformation and using a lag of 1 results in the most stationary series as I have defined it.
Let’s see it decomposed.
trans_ts$decomp %>% plot()
It certainly looks more stationary now.
Let’s look at the transformed and lagged series as well as it’s ACF and PACF.
trans_ts$diff %>% ggtsdisplay(lag.max = 20, main = paste0(stock_symbol, " - Log transformed with lag 1 Series, ACF and PACF"))
This series certainly looks more stationary as there is no clear trend and the ACF/PACF are basically all non-significant. There is one lag in each that is at just about 0.05 but we could expect one to reach this threshold since we are showing 20 lags.
Because we are using time series data we should not randomize our data but simply take the first chunk for training and validate/test on the latter chunks. I will use a 70/20/10 percent split for training, validating, and testing respectively.
split_data <- function(data, og_data, train_p=0.7, validate_p=0.2) {
if(sum(train_p, validate_p) >= 1) stop("sum of train_p and validate_p is equal to or greater than 1. No test set...")
n <- nrow(data)
a <- round(n*train_p)
b <- round(n*validate_p)
train <- data[1:a]
validate <- data[(a+1):(a+b)]
test <- data[(a+b+1):n]
og_train <- og_data[1:a]
og_validate <- og_data[(a+1):(a+b)]
og_test <- og_data[(a+b+1):n]
list(
"train" = train, "validate" = validate, "test" = test,
"og_train" = og_train, "og_validate" = og_validate, "og_test" = og_test
)
}
split_data <- split_data(trans_ts$diff, trans_ts$og)
# Quickly plot to show the data is split correctly
plot(split_data$train, main = paste0("NVDA - train data - ", nrow(split_data$train), " samples"))
plot(split_data$validate, main = paste0("NVDA - validate data - ", nrow(split_data$validate), " samples"))
plot(split_data$test, main = paste0("NVDA - test data - ", nrow(split_data$test), " samples"))
I will explore using exponential smoothing via the forecast package’s function ets and the stationary training data.
(ets_fit <- ets(split_data$train))
## ETS(A,N,N)
##
## Call:
## ets(y = split_data$train)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 6e-04
##
## sigma: 0.0411
##
## AIC AICc BIC
## 7950.517 7950.523 7969.474
From the model output above we can see that only an alpha smoothing parameters was estimated, implying that beta, gamma, and phi are not descriptive for this time series.
checkresiduals(ets_fit)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 17.114, df = 8, p-value = 0.02894
##
## Model df: 2. Total lags used: 10
Due to the non-significant p-value in the Ljung-Box test, the lack of very significant lags in the ACF and a normally shaped histogram of residual centered around 0 we can say that this model does a good job capturing all of the available information for forecasting.
Let’s make some predictions with the ets model.
ets_preds <- forecast(ets_fit, h=nrow(split_data$validate))
autoplot(ets_preds)
As you could have guessed, making prediction on our transformed data results in not very useful predictions.
Let’s untramsform the predictions and plot them alongside the actual data.
untrans <- function(x, trans_ts, og_data) {
xi = og_data[1:trans_ts$lag] %>% trans_ts$func()
x <- x %>% diffinv(lag = trans_ts$lag, xi = xi) %>% trans_ts$inv_func() %>% as.vector()
x[2:length(x)]
}
ets_preds_xts <- data.frame(
preds = untrans(ets_preds$mean, trans_ts, split_data$og_validate),
Close = split_data$og_validate$Close
) %>%
xts(order.by = index(split_data$og_validate), frequency = 365.25)
hchart(ets_preds_xts$Close, color = "#9671e5", name = "Closing price") %>%
hc_title(text = paste0(stock_symbol, " - Comparing ARIMA predictions to actual price")) %>%
hc_add_theme(thm) %>%
hc_tooltip(crosshairs = FALSE, valueDecimals = 2) %>%
hc_add_series(data = ets_preds_xts$preds, color = "#e571a9", name = "ARIMA predictions")
We can clearly see that the ets model results in a general bullish prediction that does a better job of following the actual price early on but diverges from the actual pretty quickly.
Let’s calculate the RMSE using the rmse function from the package Metrics.
(rmse_ets <- rmse(ets_preds_xts$preds, ets_preds_xts$Close))
## [1] 32.07502
Let’s compare this RMSE to the RMSE of our model on the raw data (ie. not transformed to be more stationary).
ets_preds_og <- forecast(ets_fit, h=nrow(split_data$og_validate))
ets_preds_xts_og <- data.frame(
preds = ets_preds_og$mean,
Close = split_data$og_validate$Close
) %>%
xts(order.by = index(split_data$og_validate), frequency = 365.25)
(rmse_ets_og <- rmse(ets_preds_xts_og$preds, ets_preds_xts_og$Close))
## [1] 38.86881
It looks like using the transformed data results in a much better RMSE.
However, this result isn’t really how I intend to use the model. Instead my intent is to only forecast up to five days ahead to inform decisions in options trading. In order to test the model adequately for this purpose I will continuously update the input data and forecast five days out in a walk forward validation approach
n_days <- 5
ets_preds <- c()
cat("Fitting ETS models:\n")
## Fitting ETS models:
for(i in 0:(nrow(split_data$validate)-n_days)) {
if(i %% n_days == 0) cat(paste0(i, " "))
ets_model <- ets(rbind.xts(split_data$train, split_data$validate[0:i]))
preds <- forecast(ets_model, h = n_days)
if(i == 0) {
xi <- split_data$og_train[(length(split_data$og_train)-trans_ts$lag):length(split_data$og_train)]
} else {
j <- i - (i %% n_days)
xi <- split_data$og_validate[i:(i+trans_ts$lag)]
}
untrans_preds <- untrans(preds$mean, trans_ts, xi)
ets_preds <- c(ets_preds, untrans_preds[n_days])
}
## 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245 250 255 260 265 270 275 280 285 290 295 300 305 310 315 320 325 330 335 340 345 350 355 360 365 370 375 380 385 390 395 400 405 410 415 420 425 430 435 440 445 450 455 460 465 470 475 480 485 490 495 500 505 510 515 520 525 530 535 540 545 550 555 560 565 570 575 580 585 590 595 600 605 610 615 620 625 630 635 640 645 650 655 660 665 670 675 680 685 690 695 700 705 710 715 720 725 730 735 740 745 750 755 760 765 770 775 780 785 790 795 800 805 810 815 820 825 830 835 840 845 850 855 860 865 870 875 880 885 890 895 900 905 910 915 920 925 930 935 940 945 950 955 960 965 970 975 980 985 990 995 1000 1005 1010 1015 1020 1025 1030 1035 1040 1045 1050 1055 1060 1065 1070 1075 1080 1085 1090 1095 1100 1105 1110 1115 1120 1125 1130 1135 1140 1145 1150 1155 1160 1165
Now that we have collected all the prediction data from each of our five day out forecasts lets plot it and compare to the actual price.
n <- length(split_data$og_validate$Close)
back_val_ets_preds_xts <- data.frame(
preds = ets_preds,
Close = split_data$og_validate$Close[n_days:n]
) %>%
xts(order.by = index(split_data$og_validate)[n_days:n], frequency = 365.25)
hchart(back_val_ets_preds_xts$Close, color = "#9671e5", name = "Closing price") %>%
hc_title(text = paste0(stock_symbol, " - Comparing ETS predictions to actual price in validate set")) %>%
hc_add_theme(thm) %>%
hc_tooltip(crosshairs = FALSE, valueDecimals = 2) %>%
hc_add_series(data = back_val_ets_preds_xts$preds, color = "#e571a9", name = "ETS predictions")
Zooming in to take a closer look, it appears that this rolling window five day forecast simply results in a time series that mirrors the actual data shifted five days into the future. This results in a pretty naive forecast as we could effectively recapitulate these results by just extending out the current closing price as the price for our five day out forecast.
Let’s calculate the RMSE so we can compare this model to others later.
(rmse_ets <- rmse(back_val_ets_preds_xts$preds, back_val_ets_preds_xts$Close))
## [1] 2.303449
rmse_df <- data.frame(model = "ets", val_rmse = rmse_ets)
I will now calculate predictions using the walk forward validation technique on the test set.
train_val <- rbind.xts(split_data$train, split_data$validate)
n_days <- 5
ets_preds_test <- c()
cat("Fitting ETS models:\n")
## Fitting ETS models:
for(i in 0:(nrow(split_data$test)-n_days)) {
if(i %% n_days == 0) cat(paste0(i, " "))
ets_model <- ets(rbind.xts(train_val, split_data$test[0:i]))
preds <- forecast(ets_model, h = n_days)
if(i == 0) {
xi <- split_data$og_validate[(length(split_data$og_validate)-trans_ts$lag):length(split_data$og_validate)]
} else {
j <- i - (i %% n_days)
xi <- split_data$og_test[i:(i+trans_ts$lag)]
}
untrans_preds <- untrans(preds$mean, trans_ts, xi)
ets_preds_test <- c(ets_preds_test, untrans_preds[n_days])
}
## 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245 250 255 260 265 270 275 280 285 290 295 300 305 310 315 320 325 330 335 340 345 350 355 360 365 370 375 380 385 390 395 400 405 410 415 420 425 430 435 440 445 450 455 460 465 470 475 480 485 490 495 500 505 510 515 520 525 530 535 540 545 550 555 560 565 570 575 580
Now let’s plot it.
n <- length(split_data$og_test$Close)
back_val_ets_preds_xts_test <- data.frame(
preds = ets_preds_test,
Close = split_data$og_test$Close[n_days:n]
) %>%
xts(order.by = index(split_data$og_test)[n_days:n], frequency = 365.25)
hchart(back_val_ets_preds_xts_test$Close, color = "#9671e5", name = "Closing price") %>%
hc_title(text = paste0(stock_symbol, " - Comparing ETS predictions to actual price in test set")) %>%
hc_add_theme(thm) %>%
hc_tooltip(crosshairs = FALSE, valueDecimals = 2) %>%
hc_add_series(data = back_val_ets_preds_xts_test$preds, color = "#e571a9", name = "ETS predictions")
As you could have guessed, we get a similar result to what we saw in the validate set.
let’s compute the RMSE.
(rmse_ets_test <- rmse(back_val_ets_preds_xts$preds, back_val_ets_preds_xts$Close))
## [1] 2.303449
I will now use the auto.arima function, also from the forecast package. As the name suggests, this function determines the appropriate values for (p, d, q) in building the ARIMA model. I will create a wrapper function so I can have a Boolean option for quick_search which will use the default parameters, otherwise it will set stepwise and approximation to FALSE. This performs a more exhaustive search but takes much longer. I will use the AICc as the value for determining best fit in case the sample size is small for a relatively new stock (not the case for NVDA but that is fine).
fit_auto_arima <- function(data, quick_search = FALSE) {
if(quick_search) {
return(auto.arima(data, ic = "aicc"))
}
auto.arima(data, ic = "aicc", stepwise = FALSE, approximation = FALSE)
}
(quick_auto <- fit_auto_arima(split_data$train, quick_search = TRUE))
## Series: data
## ARIMA(2,0,2) with zero mean
##
## Coefficients:
## ar1 ar2 ma1 ma2
## -1.1077 -0.7835 1.1376 0.7945
## s.e. 0.1393 0.1409 0.1369 0.1411
##
## sigma^2 = 0.00169: log likelihood = 7271.58
## AIC=-14533.15 AICc=-14533.14 BIC=-14501.56
(slow_auto <- fit_auto_arima(split_data$train))
## Series: data
## ARIMA(3,0,2) with zero mean
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2
## -0.5110 -0.9473 0.0403 0.5326 0.9608
## s.e. 0.0229 0.0166 0.0160 0.0169 0.0139
##
## sigma^2 = 0.001683: log likelihood = 7280.7
## AIC=-14549.39 AICc=-14549.37 BIC=-14511.48
The slow_auto produced a smaller AICc so we will use this approach
Let’s look at the model residuals.
checkresiduals(slow_auto)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(3,0,2) with zero mean
## Q* = 7.327, df = 5, p-value = 0.1974
##
## Model df: 5. Total lags used: 10
We can again see from a non-significant p-value of the Ljung-Box test, the non-significant lags in the ACF plot, and the normally distributed and centered on zero residuals that our model is doing a good job at capturingg the available information.
Let’s produce a naive forecast using our model and plot the predictions alongside the test set data. I will again un-transform the predictions and validation data to the actual price.
arima_preds <- forecast(slow_auto, h = nrow(split_data$validate))
arima_preds_xts <- data.frame(
preds = untrans(arima_preds$mean, trans_ts, split_data$og_validate),
Close = split_data$og_validate$Close
) %>%
xts(order.by = index(split_data$og_validate), frequency = 365.25)
hchart(arima_preds_xts$Close, color = "#9671e5", name = "Closing price") %>%
hc_title(text = paste0(stock_symbol, " - Comparing ARIMA predictions to actual price")) %>%
hc_add_theme(thm) %>%
hc_tooltip(crosshairs = FALSE, valueDecimals = 2) %>%
hc_add_series(data = arima_preds_xts$preds, color = "#e571a9", name = "ARIMA predictions")
It looks like the ARIMA model is capturing the general bullish trend but under forecasts in the long run.
Let’s calculate the RMSE.
(rmse_slow_auto <- rmse(arima_preds_xts$preds, arima_preds_xts$Close))
## [1] 34.68691
This RMSE is worse than the naive RMSE using ets.
Let’s again use the same walk forward validation approach to compile five day out forcasts. To save on computation time I will only re-evaluate the model every fifth day.
n_days <- 5
arima_preds <- c()
cat("Fitting ARIMA models:\n")
## Fitting ARIMA models:
for(i in 0:(nrow(split_data$validate)-n_days)) {
if(i %% n_days == 0) {
cat(paste0(i, " "))
arima_model <- fit_auto_arima(rbind.xts(split_data$train, split_data$validate[0:i]))
}
preds <- forecast(arima_model, h = n_days)
if(i == 0) {
xi <- split_data$og_train[(length(split_data$og_train)-trans_ts$lag):length(split_data$og_train)]
} else {
j <- i - (i %% n_days)
xi <- split_data$og_validate[i:(i+trans_ts$lag)]
}
untrans_preds <- untrans(preds$mean, trans_ts, xi)
arima_preds <- c(arima_preds, untrans_preds[n_days])
}
## 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245 250 255 260 265 270 275 280 285 290 295 300 305 310 315 320 325 330 335 340 345 350 355 360 365 370 375 380 385 390 395 400 405 410 415 420 425 430 435 440 445 450 455 460 465 470 475 480 485 490 495 500 505 510 515 520 525 530 535 540 545 550 555 560 565 570 575 580 585 590 595 600 605 610 615 620 625 630 635 640 645 650 655 660 665 670 675 680 685 690 695 700 705 710 715 720 725 730 735 740 745 750 755 760 765 770 775 780 785 790 795 800 805 810 815 820 825 830 835 840 845 850 855 860 865 870 875 880 885 890 895 900 905 910 915 920 925 930 935 940 945 950 955 960 965 970 975 980 985 990 995 1000 1005 1010 1015 1020 1025 1030 1035 1040 1045 1050 1055 1060 1065 1070 1075 1080 1085 1090 1095 1100 1105 1110 1115 1120 1125 1130 1135 1140 1145 1150 1155 1160 1165
Now let’s plot our predictions and compare to the actual price.
n <- length(split_data$og_validate$Close)
back_val_arima_preds_xts <- data.frame(
preds = arima_preds,
Close = split_data$og_validate$Close[n_days:n]
) %>%
xts(order.by = index(split_data$og_validate)[n_days:n], frequency = 365.25)
hchart(back_val_arima_preds_xts$Close, color = "#9671e5", name = "Closing price") %>%
hc_title(text = paste0(stock_symbol, " - Comparing ARIMA predictions to actual price")) %>%
hc_add_theme(thm) %>%
hc_tooltip(crosshairs = FALSE, valueDecimals = 2) %>%
hc_add_series(data = back_val_arima_preds_xts$preds, color = "#e571a9", name = "ARIMA predictions")
It’s no surprise that we see the same sort of result as with our ets model. The five day forecaste price is essentially just the current closing price.
Let’s calculate the RMSE too.
rmse_slow_auto <- rmse(back_val_arima_preds_xts$preds, back_val_arima_preds_xts$Close)
to_add <- data.frame(model = "arima", val_rmse = rmse_slow_auto)
(rmse_df <- rbind(rmse_df, to_add))
## model val_rmse
## 1 ets 2.303449
## 2 arima 2.312512
The RMSE is very similar to the ets RMSE.
Seeing as this approach is so similar to ets I will forego walk forward validation using the test set as I don’t plan on using arima going forward anyways.
Prophet is an open source additive regression algorithm for time series forecasting developed by Meta. It is intended to work best with seasonal data as it fits non-linear trends to yearly, monthly, weekly, and even daily seasonality. One benefit of prophet is we do not have to transform the data to be more stationary first as the algorithm is built to be out of the box ready for any time series data.
prep4prophet <- function(data, ds, y) {
rename(data, ds = !!ds, y = !!y) %>%
dplyr::select(ds, y) %>%
mutate(ds = as.POSIXct(ds, origin = ds[1]))
}
proph_train <- prep4prophet(
split_data$og_train %>%
as.data.frame() %>%
rownames_to_column("date"),
"date", "Close"
)
proph_fit <- prophet(proph_train)
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
The warning about daily seasonality is due to our time series being reported daily and thus is not granular enough for this type of seasonality.
Now that the model is fit lets make some predictions nd explore how the model did.
proph_val <- prep4prophet(
split_data$og_validate %>%
as.data.frame() %>%
rownames_to_column("date"),
"date", "Close"
)
proph_predict <- predict(proph_fit, proph_val)
prophet_plot_components(proph_fit, proph_predict)
From the seasonality components above we can see a general slow trend upwards over the years and a monthly trend that dips in the summer months but rebounds throughout the fall into winter. The weekly seasonality components is suspect in my opinion as the data we are using does not have data points for Sunday and Saturday so maybe prophet is imputing these (as the algorithm does this too out of the box) and thus they are artificially inflated.
plot(proph_fit, proph_predict)
The plot above clear shows a general trend upwards and ever expanding confidence intervals the further out in time the forecast goes.
val_proph_preds <- split_data$og_validate %>%
as.data.frame() %>%
mutate(preds = as.vector(proph_predict$yhat)) %>%
as.xts()
hchart(val_proph_preds$Close, color = "#9671e5", name = "Closing price") %>%
hc_title(text = paste0(stock_symbol, " - Comparing Prophet predictions to actual price")) %>%
hc_add_theme(thm) %>%
hc_tooltip(crosshairs = FALSE, valueDecimals = 2) %>%
hc_add_series(data = val_proph_preds$preds, color = "#e571a9", name = "Prophet predictions")
Comparing the long forecast to the validation data shows much the same result as the ARIMA and ets result.
Let’s calculate the RMSE on the naive approach to validation.
(rmse_proph <- rmse(val_proph_preds$Close, val_proph_preds$preds))
## [1] 33.31312
It is much the same as we have already seen
Now let’s employ the same walk forward validation strategy.
n_days <- 5
prophet_preds <- c()
cat("Fitting Prophet models:\n")
## Fitting Prophet models:
for(i in 0:(nrow(split_data$validate)-n_days)) {
if(i %% n_days == 0) {
cat(paste0(i, " "))
proph_train <- prep4prophet(bind_rows(
split_data$og_train %>%
as.data.frame() %>%
rownames_to_column("date"),
split_data$og_validate[0:i] %>%
as.data.frame() %>%
rownames_to_column("date")
), "date", "Close")
proph_model <- prophet(proph_train) %>% suppressMessages()
}
proph_val <- prep4prophet(
split_data$og_validate[i:(i + n_days)] %>%
as.data.frame() %>%
rownames_to_column("date"),
"date", "Close"
) %>% dplyr::select(ds)
proph_predict <- predict(proph_model, proph_val)
prophet_preds <- c(prophet_preds, proph_predict$yhat[n_days])
}
## 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245 250 255 260 265 270 275 280 285 290 295 300 305 310 315 320 325 330 335 340 345 350 355 360 365 370 375 380 385 390 395 400 405 410 415 420 425 430 435 440 445 450 455 460 465 470 475 480 485 490 495 500 505 510 515 520 525 530 535 540 545 550 555 560 565 570 575 580 585 590 595 600 605 610 615 620 625 630 635 640 645 650 655 660 665 670 675 680 685 690 695 700 705 710 715 720 725 730 735 740 745 750 755 760 765 770 775 780 785 790 795 800 805 810 815 820 825 830 835 840 845 850 855 860 865 870 875 880 885 890 895 900 905 910 915 920 925 930 935 940 945 950 955 960 965 970 975 980 985 990 995 1000 1005 1010 1015 1020 1025 1030 1035 1040 1045 1050 1055 1060 1065 1070 1075 1080 1085 1090 1095 1100 1105 1110 1115 1120 1125 1130 1135 1140 1145 1150 1155 1160 1165
And again, plot the results.
back_val_proph_preds <- data.frame(
preds = prophet_preds,
Close = split_data$og_validate$Close[n_days:n]
) %>%
xts(order.by = index(split_data$og_validate)[n_days:n], frequency = 365.25)
hchart(back_val_proph_preds$Close, color = "#9671e5", name = "Closing price") %>%
hc_title(text = paste0(stock_symbol, " - Comparing Prophet predictions to actual price")) %>%
hc_add_theme(thm) %>%
hc_tooltip(crosshairs = FALSE, valueDecimals = 2) %>%
hc_add_series(data = back_val_proph_preds$preds, color = "#e571a9", name = "Prophet predictions")
This is certainly an improvement from the naive forecast but results in some bizarre behavior. It seems like prophet is unable to handle the strong trend in the non-transformed data.
rmse_proph <- rmse(back_val_proph_preds$preds, back_val_proph_preds$Close)
to_add <- data.frame(model = "prophet", val_rmse = rmse_proph)
(rmse_df <- rbind(rmse_df, to_add))
## model val_rmse
## 1 ets 2.303449
## 2 arima 2.312512
## 3 prophet 13.843493
I will again forego the walk forward validation using the test set as I don’t plan on using prophet going forward.
See the knn_ts.html report for more.
See the lstm_rnn.html report for more.
While it has been a very educational process to explore these three time series forecasting models, I do not think they do a good job at applying to my intended use case. These models are able to naively capture the short term trend (on the order of months) but they are not well suited for the five day out forecasts I am interested in.
For the sake of this project I will use the exponential smoothing model using the ets function in building an ensemble model as it has the lowest RMSE.
I have explored two other models in python for time series forecasting using NVDA stock price data.
I made a custom kNN time series algorithm in this report and an LSTM-RNN multivariate model in this report.
I will combine run each of these three models on the test set using the walk forward validation technique.
n_days <- 5
# ets
t_dat <- rbind(split_data$og_train, split_data$og_validate, split_data$og_test)
trans_ts <- determine_trans(t_dat)
ets_preds <- c()
cat("Predicting ETS ...\n")
## Predicting ETS ...
for(i in 0:(nrow(split_data$test)-n_days)) {
ets_fit <- ets(rbind.xts(split_data$train, split_data$validate, split_data$test[0:i]))
ets_pred <- forecast(ets_fit, h=5)
if(i == 0) {
xi <- split_data$og_validate[(length(split_data$og_validate)-trans_ts$lag):length(split_data$og_validate)]
} else {
xi <- split_data$og_test[i:(i+trans_ts$lag)]
}
untrans_preds <- untrans(ets_pred$mean, trans_ts, xi)
ets_preds <- c(ets_preds, untrans_preds[n_days])
}
# knn_ts
cat("Predicting kNN ...\n")
## Predicting kNN ...
knn_preds <- c()
for(i in 0:(nrow(split_data$og_test)-n_days)) {
data <- rbind.xts(split_data$og_train, split_data$og_validate, split_data$og_test[0:i])
data <- data[,1] %>% r_to_py()
pred <- knn_ts(data, 15, 8, 2, 5)
knn_preds <- c(knn_preds, pred$predictions[5])
}
# lstm-rnn
cat("Predicting LSTM-RNN ...\n")
## Predicting LSTM-RNN ...
start <- index(split_data$og_test) %>% head(1)
end <- index(split_data$og_test) %>% tail(1)
lstm_preds <- ensemble_lstm(start, end)
# compile
preds_df <- data.frame(
ets = ets_preds[32:length(ets_preds)],
knn = knn_preds[32:length(knn_preds)],
lstm = lstm_preds
)
preds_df$mu <- rowMeans(preds_df)
preds_df$Close <- split_data$og_test$Close[36:length(split_data$og_test$Close)] %>% as.vector()
index <- index(split_data$og_test$Close[36:length(split_data$og_test$Close)])
preds_df <- xts(preds_df, order.by = index)
head(preds_df)
## ets knn lstm mu Close
## 2020-02-26 78.82343 88.52962 77.50250 81.61852 66.73260
## 2020-02-27 77.31983 83.31903 78.80744 79.81543 63.01789
## 2020-02-28 73.65364 72.93303 80.28202 75.62289 67.37625
## 2020-03-02 68.44397 72.04861 81.63677 74.04312 68.96294
## 2020-03-03 65.62997 61.66353 82.44392 69.91247 66.33345
## 2020-03-04 67.03316 66.40625 82.57640 72.00527 70.97871
hchart(preds_df$Close, color = "#9671e5", name = "Closing price") %>%
hc_title(text = paste0(stock_symbol, " - Comparing ensemble predictions to actual price")) %>%
hc_add_theme(thm) %>%
hc_tooltip(crosshairs = FALSE, valueDecimals = 2) %>%
hc_add_series(data = preds_df$mu, color = "#e571a9", name = "Ensemble predictions")
It looks like the ensemble model performed reasonabley well on the test set.
Let’s look at the RMSE.
(rmse_test <- rmse(preds_df$Close, preds_df$mu))
## [1] 14.04808
That is a decent RMSE all things considered.
Lets look at RMSE of each of the sub models.
# ETS
rmse(preds_df$Close, preds_df$ets)
## [1] 12.78775
# kNN
rmse(preds_df$Close, preds_df$knn)
## [1] 18.19214
# LSTM-RNN
rmse(preds_df$Close, preds_df$lstm)
## [1] 18.24644
It looks like the ETS model performs best. This is kinda of discouraging since the five day forecast with ETS model was basically just an extension of the current price. What these results tell me is that trying to predict stock prices is very very hard and my simple approaches do not come close to a good answer.
Let’s make a ensemble model function and predict into the future.
ensemble_model <- function(data) {
n_days <- 5
# ets
trans_ts <- determine_trans(data)
cat("Predicting ETS ...\n")
ets_fit <- ets(trans_ts$diff)
ets_pred <- forecast(ets_fit, h=5)
xi <- data[(length(data)-trans_ts$lag):length(data)]
ets_pred <- untrans(ets_pred$mean, trans_ts, xi) %>% tail(1)
# knn_ts
cat("Predicting kNN ...\n")
data <- data[,1] %>% r_to_py()
pred <- knn_ts(data, 15, 8, 2, 5)
knn_pred <- pred$predictions[5]
# lstm-rnn
cat("Predicting LSTM-RNN ...\n")
start <- index(data) %>% head(1)
end <- index(data) %>% tail(1)
preds <- ensemble_lstm(start, end)
lstm_pred <- preds %>% tail(1)
# compile
preds_df <- data.frame(
ets = ets_pred,
knn = knn_pred,
lstm = lstm_pred,
mu = mean(ets_pred, knn_pred, lstm_pred)
)
preds_df
}
(pred <- ensemble_model(xts_nvda$Close))
## Predicting ETS ...
## Predicting kNN ...
## Predicting LSTM-RNN ...
## ets knn lstm mu
## 1 198.8834 195.9811 198.2005 198.8834
Seeing as we are using all the available data up to April 29th, 2022 here, this prediction would be for May 6th which is five trading days later. It looks like the ensemble predicts a price of $198.88 but time will tell how accurate this prediction really is.
I have learned a lot doing this project. The main take away for me is that these naive approaches do not do a good job at capturing my intended use, namely forecasting closing stock price five days out to inform options trading strategies.
I think the LSTM-RNN approach could be improved upon with future work but that would require many more hours of gathering and bringing in associated data (maybe Google Trends, Consumer Price Index, etc.), exploring new model architectures, and fine tuning parameters.